首页> 外文OA文献 >A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs
【2h】

A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs

机译:GpU上的内存带宽高效混合基数排序

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Sorting is at the core of many database operations, such as index creation,sort-merge joins, and user-requested output sorting. As GPUs are emerging as apromising platform to accelerate various operations, sorting on GPUs becomes aviable endeavour. Over the past few years, several improvements have beenproposed for sorting on GPUs, leading to the first radix sort implementationsthat achieve a sorting rate of over one billion 32-bit keys per second. Yet,state-of-the-art approaches are heavily memory bandwidth-bound, as they requiresubstantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memorytransfers and, therefore, considerably lifts the memory bandwidth limitation.Being able to sort two gigabytes of eight-byte records in as little as 50milliseconds, our approach achieves a 2.32-fold improvement over thestate-of-the-art GPU-based radix sort for uniform distributions, sustaining aminimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed theavailable device memory, we build on our efficient GPU sorting approach with apipelined heterogeneous sorting algorithm that mitigates the overheadassociated with PCIe data transfers. Comparing the end-to-end sortingperformance to the state-of-the-art CPU-based radix sort running 16 threads,our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement forsorting 64 GB key-value pairs with a skewed and a uniform distribution,respectively.
机译:排序是许多数据库操作的核心,例如索引创建,排序合并联接和用户请求的输出排序。随着GPU逐渐成为有希望的平台来加速各种操作,在GPU上进行分类变得可行。在过去的几年中,已提出了对在GPU上进行排序的一些改进,从而导致了第一个基数排序实现,该实现实现了每秒超过十亿个32位键的排序速率。但是,最先进的方法在内存带宽上有很大的限制,因为与基于CPU的方法相比,它们所需的内存传输量要大得多。我们的工作提出了一种新颖的方法,几乎​​减少了一半的内存传输量,因此大大提高了内存带宽限制。由于能够在短短的50毫秒内对2 GB的8字节记录进行排序,因此我们的方法比以前提高了2.32倍先进的基于GPU的基数排序可实现均匀分布,对于偏斜的分布,其最小加速至少保持1.66倍。为了解决输入未驻留在GPU上或超出可用设备内存的问题,我们在有效的GPU排序方法的基础上采用了流水线的异构排序算法,可减轻与PCIe数据传输相关的开销。将端到端排序性能与运行16个线程的基于CPU的最新基数排序进行比较,我们的异构方法对以偏斜方式排序64 GB键值对实现了2.06倍和1.53倍的改进。并分别分布均匀。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号